Writing Functions

Tuesday, February 20

Today we will…

  • New Material
    • Function Basics
    • Variable Scope + Environment
  • Work time:
    • PA 7: Writing Functions

Why write functions?

Functions allow you to automate common tasks!

  • We’ve been using functions since Day 1, but when we write our own, we can customize them!
  • Have you found yourself copy-pasting code and only changing small parts?

Writing functions has 3 big advantages over copy-paste:

  1. Your code is easier to read.
  2. To change your analysis, simply change one function.
  3. You avoid mistakes from copy-paste.

Function Basics

Function Syntax


Function Syntax

A (very) Simple Function

Let’s define the function.

  • You must run the code to define the function just once.
add_two <- function(x){
  x + 2
}


Let’s call the function!

add_two(5)
[1] 7

Naming: add_two <-

The name of the function is chosen by the author.

add_two <- function(x){
  x + 2
}

Caution: Function names have no inherent meaning.

  • The name you give to a function does not affect what the function does.
add_three <- function(x){
  x + 7
}
add_three(5)
[1] 12

Arguments

The argument(s) of the function are chosen by the author.

  • Arguments are how we pass external values into the function.
  • They are temporary variables that only exist inside the function body.
  • We give them general names:
    • x, y, z – vectors
    • df – dataframe
    • i, j – indices


add_two <- function(x){
  x + 2
}

Arguments

If we supply a default value when defining the function, the argument is optional when calling the function.

add_something <- function(x, something = 2){
  return(x + something)
}
  • If a value is not supplied, something defaults to 2.
add_something(x = 5)
[1] 7
add_something(x = 5, something = 6)
[1] 11

If we do not supply a default value when defining the function, the argument is required when calling the function.

add_something <- function(x, something){
  x + something
}

add_something(x = 2)
Error in add_something(x = 2): argument "something" is missing, with no default

Body: { }

The body of the function is where the action happens.

  • The body must be specified within a set of curly brackets.
  • The code in the body will be executed (in order) whenever the function is called.
add_two <- function(x){
  x + 2
}

Output: return()

Your function will give back what would normally print out

add_two <- function(x){
  x + 2
}


7 + 2
[1] 9
add_two(7)
[1] 9


…but it’s better to be explicit and use return().

add_two <- function(x){
  return(x + 2)
}

Input Validation

When a function requires an input of a specific data type, check that the supplied argument is valid.

add_something <- function(x, something){
  stopifnot(is.numeric(x))
  return(x + something)
}

add_something(x = "statistics", something = 5)
Error in add_something(x = "statistics", something = 5): is.numeric(x) is not TRUE
add_something <- function(x, something){
  stopifnot(is.numeric(x),
            is.numeric(something)
            )
  return(x + something)
}


add_something(x = "dog", something = 2)
Error in add_something(x = "dog", something = 2): is.numeric(x) is not TRUE
add_something(x = 2, something = "something")
Error in add_something(x = 2, something = "something"): is.numeric(something) is not TRUE
add_something <- function(x, something){
  if(!is.numeric(x)){
    stop("Please provide a numeric input for the x argument.")
  }
  return(x + something)
}

add_something(x = "statistics", something = 5)
Error in add_something(x = "statistics", something = 5): Please provide a numeric input for the x argument.
add_something <- function(x, something){
  if(!is.numeric(x) | !is.numeric(something)){
    stop("Please provide numeric inputs for both arguments.")
  }
  return(x + something)
}

add_something(x = 2, something = "R")
Error in add_something(x = 2, something = "R"): Please provide numeric inputs for both arguments.

Variable Scope + Environment

Variable Scope

The location (environment) in which we can find and access a variable is called its scope.

  • We need to think about the scope of variables when we write functions.
  • What variables can we access inside a function? What variables can we access outside a function?

Global Environment

  • The top right pane of Rstudio shows you the global environment.
    • This is the current state of all objects you have created.
    • These objects can be accessed anywhere.

Function Environment

  • The code inside a function executes in the function environment.
    • Function arguments and any variables created inside the function only exist inside the function.
      • They disappear when the function code is complete.
    • What happens in the function environment does not affect things in the global environment.

Function Environment

We cannot access variables created inside a function outside of the function.

add_two <- function(x) {
  my_result <- x + 2
  return(my_result)
}

add_two(9)
[1] 11
my_result
Error in eval(expr, envir, enclos): object 'my_result' not found

Name Masking

Name masking occurs when an object in the function environment has the same name as an object in the global environment.

add_two <- function(x) {
  my_result <- x + 2
  return(my_result)
}
my_result <- 2000

The my_result created inside the function is different from the my_result created outside.

add_two(5)
[1] 7
my_result
[1] 2000

Dynamic Lookup

Functions look for objects FIRST in the function environment and SECOND in the global environment.

  • If the object doesn’t exist in either, the code will give an error.
add_two <- function() {
  return(x + 2)
}

add_two()
Error in add_two(): object 'x' not found
x <- 10

add_two()
[1] 12

It is not good practice to rely on global environment objects inside a function!

Debugging

The faces of debugging (by Allison Horst)

Debugging

You will make mistakes (create bugs) when coding.

  • Unfortunately, it becomes more and more complicated to debug your code as your code gets more sophisticated.

Debugging Strategies

  • Interactive coding
    • Highlight lines within your function and run them one-by-one to see what happens.
  • print() debugging
    • Add print() statements throughout your code to make sure the values are what you expect.
  • Rubber Ducking
    • Verbally explain your code line by line to a rubber duck (or a human).

Debugging Strategies

When you have a concept that you want to turn into a function…

  1. Write a simple example of the code without the function framework.

  2. Generalize the example by assigning variables.

  3. Write the code into a function.

  4. Call the function on the desired arguments

This structure allows you to address issues as you go.

An Example

Write a function called find_car_make() that takes in the name of a car and returns the “make” of the car (the company that created it).

  • find_car_make("Toyota Camry") should return “Toyota”.
  • find_car_make("Ford Anglica") should return “Ford”.

An Example

make <- str_extract(string = "Toyota Camry",
                    pattern = "[:alpha:]*")
make
[1] "Toyota"
make <- str_extract(string = "Ford Anglica",
                    pattern = "[:alpha:]*")
make
[1] "Ford"
car_name <- "Toyota Camry"

make <- str_extract(string = car_name, 
                    pattern = "[:alpha:]*")
make
[1] "Toyota"
find_car_make <- function(car_name){
  make <- str_extract(string = car_name, 
                      pattern = "[:alpha:]*")
  return(make)
}
find_car_make("Toyota Camry")
[1] "Toyota"
find_car_make("Ford Anglica")
[1] "Ford"

Practice Activity 7: Writing Functions

You will write several small functions, then use them to unscramble a message. Many of the functions have been started for you, but none of them are complete as is.

To do…

Thursday, February 22

Today we will…

  • New Material
    • Calling Functions on Data sets
    • Thinking About Missing Data
  • Lab 7: Functions and Fish

Calling Functions on Datasets

Last Time…

We wrote a function called find_car_make() that takes in the name of a car and returns the “make” of the car (the company that created it).

  • find_car_make("Toyota Camry") returns “Toyota”.
  • find_car_make("Ford Anglica") returns “Ford”.
find_car_make <- function(car_name){
  make <- str_extract(string = car_name, 
                      pattern = "[:alpha:]*")
  return(make)
}

Pair Our Function with dplyr

Consider the mtcars data.

data(mtcars)
head(mtcars, n = 3)
               mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Let’s use our new function:

mtcars |> 
  rownames_to_column("make_model") |> 
  mutate(make = find_car_make(make_model),
         .after = make_model) |> 
  head(n = 3)
     make_model   make  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1     Mazda RX4  Mazda 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 Mazda RX4 Wag  Mazda 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3    Datsun 710 Datsun 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Recall the penguins data

library(palmerpenguins)
data(penguins)
penguins |> 
  head()
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

Function to Standardize Data

We want to take in a vector of numbers and standardize it – make all values be between 0 and 1.

std_to_01 <- function(var) {
  stopifnot(is.numeric(var))
  
  num   <- var - min(var, na.rm = TRUE)
  denom <- max(var, na.rm = TRUE) - min(var, na.rm = TRUE)
  
  return(num / denom)
}

Pair Our Function with dplyr

Let’s standardize penguin measurements.

penguins |> 
  mutate(bill_length_mm    = std_to_01(bill_length_mm), 
         bill_depth_mm     = std_to_01(bill_depth_mm), 
         flipper_length_mm = std_to_01(flipper_length_mm), 
         body_mass_g       = std_to_01(body_mass_g)
         ) |> 
  head()
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torgersen          0.255         0.667             0.153       0.292
2 Adelie  Torgersen          0.269         0.512             0.237       0.306
3 Adelie  Torgersen          0.298         0.583             0.390       0.153
4 Adelie  Torgersen         NA            NA                NA          NA    
5 Adelie  Torgersen          0.167         0.738             0.356       0.208
6 Adelie  Torgersen          0.262         0.893             0.305       0.264
# ℹ 2 more variables: sex <fct>, year <int>

Ugh. Still copy-pasting!

Pair Our Function with dplyr

Recall across()!

penguins |> 
  mutate(across(.cols = bill_length_mm:body_mass_g, 
                .fns = ~ std_to_01(var = .x)
                )
         ) |> 
  head()
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torgersen          0.255         0.667             0.153       0.292
2 Adelie  Torgersen          0.269         0.512             0.237       0.306
3 Adelie  Torgersen          0.298         0.583             0.390       0.153
4 Adelie  Torgersen         NA            NA                NA          NA    
5 Adelie  Torgersen          0.167         0.738             0.356       0.208
6 Adelie  Torgersen          0.262         0.893             0.305       0.264
# ℹ 2 more variables: sex <fct>, year <int>

Scaling Variables

Is it a good idea to scale (standardize) variables in a data analysis?

Why scale?

  • Easier to compare across variables.
  • Easier to model – standardizes the amount of variability.

Why not scale?

  • More difficult to interpret the values.

E.g., a penguin with a bill length of 35 mm (std to 0.11) and a mass of 5500 g (std to 0.78).

Use variables as function arguments?

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  data <- data |> 
    mutate(variable = std_to_01(var = variable))
  
  return(data)
}

Note

I used the existing function std_to_01(var) inside the new function for clarity!

But it didn’t work…

std_column_01(data = penguins, var = body_mass_g)
Error in `mutate()`:
ℹ In argument: `variable = std_to_01(var = variable)`.
Caused by error:
! object 'body_mass_g' not found

Tidy Evaluation

Functions using unquoted variable names as arguments are said to use nonstandard evaluation or tidy evaluation.

Tidy:

penguins |> 
  pull(body_mass_g)

  OR

penguins$body_mass_g

Untidy:

penguins[, "body_mass_g"]

  OR

penguins[["body_mass_g"]]


Tidy evaluation isn’t naturally supported when writing your own functions.

Defused R Code

When a piece of code is defused, R doesn’t return its value like normal.

  • Instead it returns an expression that describes how to evaluate it.

Evaluated code:

1 + 1
[1] 2

Defused code:

expr(1 + 1)
1 + 1

We produce defused code when we use tidy evaluation and our own functions don’t know how to handle it.

Solution 1

Don’t use tidy evaluation in your own functions.

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  data[[variable]] <- std_to_01(var = data[[variable]])
  return(data)
}

std_column_01(penguins, "bill_length_mm")

This is more complicated to read and use, but it’s safe.

Solution 2

Use embrace injection.

  • The rlang package provides the embrace operator ({{ }}) to simplify writing functions around tidyverse pipelines.

  • With the {{ }} operator, you can transport a variable from one function to another and can get around defused code!
  • Alternatively, you can use enquo(arg) to difuse and !!arg to inject.
  • Read more here.

Recall Our Broken Function

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  data <- data |> 
    mutate(variable = std_to_01(variable))
  return(data)
}
std_column_01(data = penguins, variable = body_mass_g)
Error in `mutate()`:
ℹ In argument: `variable = std_to_01(variable)`.
Caused by error:
! object 'body_mass_g' not found
  • The code is defused, so mutate() doesn’t know what body_mass_g is.
  • We can difuse with enquo(variable) and inject !!variable
  • Or we can embrace inject in the body_mass_g variable using {{ }}!

The Walrus Operator: :=

When we use the embrace operator, we also have to use the walrus operator:= instead of ==.

Code
std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  variable <- enquo(variable)

  data <- data |>
    mutate(variable := std_to_01(!!variable)
           )
  
  return(data)
}

std_column_01(data = penguins, variable = body_mass_g)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <dbl>
1 Adelie  Torgersen           39.1          18.7               181       0.292
2 Adelie  Torgersen           39.5          17.4               186       0.306
3 Adelie  Torgersen           40.3          18                 195       0.153
4 Adelie  Torgersen           NA            NA                  NA      NA    
5 Adelie  Torgersen           36.7          19.3               193       0.208
6 Adelie  Torgersen           39.3          20.6               190       0.264
# ℹ 2 more variables: sex <fct>, year <int>
Code
std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))

  data <- data |>
    mutate({{ variable }} := std_to_01({{ variable }})
           )
  return(data)
}

std_column_01(data = penguins, variable = body_mass_g)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <dbl>
1 Adelie  Torgersen           39.1          18.7               181       0.292
2 Adelie  Torgersen           39.5          17.4               186       0.306
3 Adelie  Torgersen           40.3          18                 195       0.153
4 Adelie  Torgersen           NA            NA                  NA      NA    
5 Adelie  Torgersen           36.7          19.3               193       0.208
6 Adelie  Torgersen           39.3          20.6               190       0.264
# ℹ 2 more variables: sex <fct>, year <int>

Inject Multiple Variables

What if I want to modify multiple columns?

  • Use across()!
Code
std_column_01 <- function(data, variables) {
  stopifnot(is.data.frame(data))
  
  data <- data |> 
    mutate(across(.cols = {{ variables }}, 
                  .fns = ~ std_to_01(var = .x)
                  )
           )
  
  return(data)
}

std_column_01(data = penguins, variables = bill_length_mm:body_mass_g)
# A tibble: 5 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torgersen          0.255         0.667             0.153       0.292
2 Adelie  Torgersen          0.269         0.512             0.237       0.306
3 Adelie  Torgersen          0.298         0.583             0.390       0.153
4 Adelie  Torgersen         NA            NA                NA          NA    
5 Adelie  Torgersen          0.167         0.738             0.356       0.208
# ℹ 2 more variables: sex <fct>, year <int>

Interesting reads

Article on How Building Functions with Variable Names has Changed Over the Years

rlang Article on Data Masking

Missing Data

Types of Missing Data

  1. Missing Completely at Random (MCAR)
    • No difference between missing and observed values.
    • Missing observations are a random subset of all observations.
  2. Missing at Random (MAR)
    • Systematic difference between missing and observed values, but can be entirely explained by other observed variables.
  3. Missing Not at Random (MNAR)
    • Missingness is directly related to the unobserved value.

Types of Missing Data

Consider a study of depression.

  1. Missing Completely at Random (MCAR)
    • Some subjects have missing lab values because a batch of samples was processed improperly.
  2. Missing at Random (MAR)
    • Subjects who identify as male are less likely to complete a survey on depression severity.
  3. Missing Not at Random (MNAR)
    • Subjects with more severe depression are less likely to complete a survey on depression severity.

When we remove missing data…

We implicitly assume observations are missing completely at random!

  • We might be mostly removing data from subjects who identify as male.
  • We might be mostly removing data from subjects with severe depression.
  • We are inadvertently making our data less representative.

We need to take more care when dealing with missing values!

Dealing with Missing Data

  • Look for patterns!
    • Do observations with missing values have similar traits?
  • Consider outside explanations!
    • Why might missing data exist?
    • Should we have a “missing” category in our analysis?
  • Can we impute values?
    • If serotonin levels are MAR within sex at birth and country, then the distribution of serotonin levels will be similar for individuals of the same sex at birth and country.

To do…

  • Lab 7: Functions + Fish
    • Due Monday, 2/26 at 11:59pm
  • Read Chapter 8: Functional Programming
    • Check-in 8.1 due Tuesday 2/27 at 8:00am